Unsupervised Learning: Trade&Ahead

Marks: 60

Context

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

Load dataset

In [2]:
#Connecting Google drive with Google colab
# Reading the data-set into Google colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
#Reading the "stock_data.csv" dataset into a dataframe (i.e.loading the data)
path="/content/drive/My Drive/stock_data.csv"
data = pd.read_csv(path)

Data Overview

  • Observations
  • Sanity checks -These can be achieved by doing the following;
  1. Viewing the first and last few rows of the dataset
  2. Checking the shape of the dataset
  3. Ensuring that the data is stored in the correct format, it's important to identify the data types.
  4. Getting the statistical summary for the variables.
  5. Checking for missing values.
  6. Checking for duplicates

View the first and last 5 few rows of the dataset

In [4]:
# returning the first 5 rows using the dataframe head method
data.head()
Out[4]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 AAL American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219
1 ABBV AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068
2 ABT Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171
3 ADBE Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651
4 ADI Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810
In [5]:
# returning the last 5 rows using dataframe tail method
data.tail()
Out[5]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
335 YHOO Yahoo Inc. Information Technology Internet Software & Services 33.259998 14.887727 1.845149 15 459 -1032187000 -4359082000 -4.64 939457327.6 28.976191 6.261775
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 435353535.4 17.682214 -3.838260
337 ZBH Zimmer Biomet Holdings Health Care Health Care Equipment 102.589996 9.347683 1.404206 1 100 376000000 147000000 0.78 188461538.5 131.525636 -23.884449
338 ZION Zions Bancorp Financials Regional Banks 27.299999 -1.158588 1.468176 4 99 -43623000 309471000 1.20 257892500.0 22.749999 -0.063096
339 ZTS Zoetis Health Care Pharmaceuticals 47.919998 16.678836 1.610285 32 65 272000000 339000000 0.68 498529411.8 70.470585 1.723068

Checking the shape of the dataset

In [6]:
#checking shape of the dataframe to find out the number of rows and columns using the dataframe shape command
print("There are", data.shape[0], 'rows and', data.shape[1], "columns.")
There are 340 rows and 15 columns.

Checking the columns data types for the dataset

In [7]:
# Using the dataframe info() method to print a concise summary of the DataFrame
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB

Observation

  • The dataset contains 15 series (columns) of which four of the series is of the object datatype (Ticker Symbol,Security,GICS Sector and GICS Sub Industry), seven(7) of the series are of the float datatype (Current Price,Price Change, Volatility, Earnings Per Share, Estimated Shares Outstanding,P/E Ratio, and P/B Ratio), while four(4) of the series are of the integer datatype (ROE, Cash Ratio, Net Cash Flow and Net Income).
  • All of these dtypes are appropriate for their respective columns

  • Total memory usage is approximately 40.0+ KB.

Creating a copy of original data

In [8]:
# copying the data to another variable to avoid any changes to original data
df = data.copy()

Getting the statistical summary for the variables.

In [9]:
# checking the statistical summary of the data using describe command and transposing.
df.describe(include='all').T
Out[9]:
count unique top freq mean std min 25% 50% 75% max
Ticker Symbol 340 340 AAL 1 NaN NaN NaN NaN NaN NaN NaN
Security 340 340 American Airlines Group 1 NaN NaN NaN NaN NaN NaN NaN
GICS Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current Price 340.0 NaN NaN NaN 80.862345 98.055086 4.5 38.555 59.705 92.880001 1274.949951
Price Change 340.0 NaN NaN NaN 4.078194 12.006338 -47.129693 -0.939484 4.819505 10.695493 55.051683
Volatility 340.0 NaN NaN NaN 1.525976 0.591798 0.733163 1.134878 1.385593 1.695549 4.580042
ROE 340.0 NaN NaN NaN 39.597059 96.547538 1.0 9.75 15.0 27.0 917.0
Cash Ratio 340.0 NaN NaN NaN 70.023529 90.421331 0.0 18.0 47.0 99.0 958.0
Net Cash Flow 340.0 NaN NaN NaN 55537620.588235 1946365312.175789 -11208000000.0 -193906500.0 2098000.0 169810750.0 20764000000.0
Net Income 340.0 NaN NaN NaN 1494384602.941176 3940150279.327936 -23528000000.0 352301250.0 707336000.0 1899000000.0 24442000000.0
Earnings Per Share 340.0 NaN NaN NaN 2.776662 6.587779 -61.2 1.5575 2.895 4.62 50.09
Estimated Shares Outstanding 340.0 NaN NaN NaN 577028337.75403 845849595.417695 27672156.86 158848216.1 309675137.8 573117457.325 6159292035.0
P/E Ratio 340.0 NaN NaN NaN 32.612563 44.348731 2.935451 15.044653 20.819876 31.764755 528.039074
P/B Ratio 340.0 NaN NaN NaN -1.718249 13.966912 -76.119077 -4.352056 -1.06717 3.917066 129.064585

Observation

  • There are 340 observations present in all

  • Differences between mean and median values indicate skewness in the data

  • There are 11 GICS Sectors of which most of the are of the industrial sector

  • There are 104 GICS Sub induustry

  • 50% of the current stock prices are below 59.705 dollars, however, the maximum current stock price is approximately 1275 dollars.
  • 75% of the variations or change in price within the selected companies is not below 10.69 dollars.
  • 75% of the volatility which is a measure of the standard deviation of the stock price for the past 13 weeks was found to be below 1.6956
  • 75% of the companies has its return on equity (ROE) lass than 27%
  • 50% of the companies has a favoured cash ratio of 47.0. This means that most of the companies has enough cash or cash equivalent to entirely pay off all their short term debts.
  • 25% of the companies has a negative net cash flow of -193906500.0 dollars. This means that the companies are paying more for obligations and liabilities than what it earns through operations.
  • 75% of the companies under study has a net income of 1899000000 dollars. This indicates that after all expenses in the companies including salaries, wages, cost of good or materials, and taxes. They seem to have enough to take home.
  • 75% of the companies has an Earning per share below 4.62 dollars. This means that majority of the companies if invested upon will not give a profit greater than 4.62 dollars for each share bought by an investor.

Checking for missing values

In [10]:
# Checking for missing values
df.isnull().sum()
Out[10]:
Ticker Symbol                   0
Security                        0
GICS Sector                     0
GICS Sub Industry               0
Current Price                   0
Price Change                    0
Volatility                      0
ROE                             0
Cash Ratio                      0
Net Cash Flow                   0
Net Income                      0
Earnings Per Share              0
Estimated Shares Outstanding    0
P/E Ratio                       0
P/B Ratio                       0
dtype: int64

Observation No null values present, therefore no missing values listed

Checking unique values

In [11]:
df.nunique()
Out[11]:
Ticker Symbol                   340
Security                        340
GICS Sector                      11
GICS Sub Industry               104
Current Price                   336
Price Change                    340
Volatility                      340
ROE                              82
Cash Ratio                      124
Net Cash Flow                   332
Net Income                      335
Earnings Per Share              268
Estimated Shares Outstanding    323
P/E Ratio                       295
P/B Ratio                       299
dtype: int64

Checking for duplicate values

In [12]:
# checking for duplicate values
df.duplicated().sum()
Out[12]:
0

Observation

  • There are no duplicate values in the dataset

Exploratory Data Analysis (EDA)

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

Univariate analysis

In [13]:
from matplotlib import patches
import random
#creating a histogram and Boxplot using function

def histobox_plot(df, column, figsize=(15, 10), kde=False, bins=None):
    #plt.figure(figsize = (20,10))
    # set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
    sns.set(style="darkgrid")
    # creating a figure composed of two matplotlib.Axes objects (ax_box and ax_hist)
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)},figsize=figsize,)
    # assigning a graph to each ax
    sns.boxplot(df, x=column, ax=ax_box,showmeans=True, color="violet")
    sns.histplot(data=df, x=column, ax=ax_hist)
    ax_hist.axvline(
        data[column].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist.axvline(
        data[column].median(), color="black", linestyle="-"
    )  # Add median to the histogram
    # Remove x axis name for the boxplot
    ax_box.set(xlabel='')
    for p in ax_hist.patches:
      height = p.get_height() # get the height of each bar
      # adding text to each bar
      ax_hist.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
      y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
      s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
      ha = 'center') # sets horizontal alignment (ha) to center

Current Price

In [14]:
histobox_plot(df, 'Current Price')

Observation

  • The average current stock price of the companies under study is 80.86 dollars which is higher than the median 59.705 dollars indicating that the distribution is a right-skewed with 49 of the 340 stocks having twice the median value of all stocks
  • As expected, no stock is listed at less of less than 0 dollars
  • There are outliers

Price Change

In [15]:
histobox_plot(df, 'Price Change')

Observation

  • The average variation in price or price change of the companies under study is 4.078 dollars which is lesser than the median 4.819 dollars indicating that the distribution is a left-skewed
  • While there is a tendency for the distribution to favor lower volatilities, it is important to note the presence of extended tails in both positive and negative price changes. Among the most volatile stocks, price fluctuations span from a significant 47% decrease to a substantial 55% increase over a 13-week period.
  • There are outliers

Volatility

In [16]:
histobox_plot(df, 'Volatility')

Observation

  • The mean volatility of the companies under study is 1.526 which is higher than the median 1.386 dollars indicating that the distribution is a right-skewed
  • There are outliers

ROE

In [17]:
histobox_plot(df, 'ROE')

Observation

  • The mean Return on Equity (ROE) of the companies under study is 39.60% which is higher than the median 15.0% indicating that the distribution is a right-skewed.
  • There are no stocks listed with values below 0 for either metric. For instance, there are 24 stocks listed with returns on equity below 5, while 25 stocks are listed with returns exceeding 100 percent.
  • There are outliers

Cash Ratio

In [18]:
histobox_plot(df, 'Cash Ratio')

Observation

  • The mean cash ratio of the companies under study is 70.02 which is higher than the median 47 indicating that the distribution is a right-skewed
  • There are outliers

Net Cash Flow

In [19]:
histobox_plot(df, 'Net Cash Flow')

Observation

  • The mean Net Cash Flow of the companies under study is 55537620.59 dollars which is higher than the median 2098000 dollars indicating that the distribution is a right-skewed
  • There are outliers

Net Income

In [20]:
histobox_plot(df, 'Net Income')

Observation

  • The mean Net Income of the companies under study is 1494384602.94 dollars which is higher than the median 707336000 dollars indicating that the distribution is a right-skewed
  • There are outliers

Earnings Per Share

In [21]:
histobox_plot(df, 'Earnings Per Share')

Observation

  • The mean Earnings per share of the companies under study is 2.77 dollars which is lesser than the median 2.90 dollars indicating that the distribution is a left-skewed
  • There are outliers

Estimated Shares Outstanding

In [22]:
histobox_plot(df, 'Estimated Shares Outstanding')

Observation

  • The mean Estimated Shares Outstanding of the companies under study is 577028337.75 dollars which is higher than the median 309675137.8 dollars indicating that the distribution is a right-skewed
  • There are outliers

P/E Ratio

In [23]:
histobox_plot(df, 'P/E Ratio')

Observation

  • The mean P/E ratio of the companies under study is 32.6 which is greater than the median 20.82 dollars indicating that the distribution is a right-skewed
  • There are outliers

P/B Ratio

In [24]:
histobox_plot(df, 'P/B Ratio')

Observation

  • The P/B of the companies under study is -1.718 which is lesser than the median -1.067 indicating that the distribution is a left-skewed
  • The distribution for P/B ratios is mostly centered around 0 but with long positive and negative
    • For example, 175 of the 340 total stocks are shown to below the 25th percentile and above the 75th percentile and
    • Additionally, 31 of the stocks are outliers
In [25]:
# function to create labeled barplots


def labeled_barplot(df, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(df[feature])  # length of the column
    count = df[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=df,
        x=feature,
        palette="Paired",
        order=df[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

GICS Sector

In [26]:
labeled_barplot(df, 'GICS Sector', perc=True)

GICS Sub Industry

In [27]:
labeled_barplot(df, 'GICS Sub Industry', perc=True)

Bivariate Analysis

In [28]:
# #Using heatmap to check correlation between series
plt.figure(figsize=(15, 7))
sns.heatmap(
    df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

Observation :Some variables exhibit moderate correlations (+/- .40) with one another.

  • Volatility demonstrates a negative correlation with price change, implying that an increase in a stock's volatility is likely accompanied by a decrease in its price.
  • Net income displays a negative correlation with volatility, suggesting that higher net income is associated with lower price volatility.
  • Net income also exhibits a positive correlation with earnings per share (EPS) and estimated shares outstanding.
  • EPS shows a positive correlation with the current price, indicating that an increase in a company's EPS is likely to lead to a corresponding rise in its stock price.
  • EPS is inversely correlated with return on equity (ROE), implying that as a company generates more equity for shareholders, the subsequent net income will yield a comparatively lower return.

The stocks of which economic sector have seen the maximum price increase on average?.

In [29]:
df.groupby('GICS Sector')['Price Change'].mean().sort_values()
Out[29]:
GICS Sector
Energy                        -10.228289
Utilities                       0.803657
Industrials                     2.833127
Financials                      3.865406
Materials                       5.589738
Consumer Discretionary          5.846093
Real Estate                     6.205548
Telecommunications Services     6.956980
Information Technology          7.217476
Consumer Staples                8.684750
Health Care                     9.585652
Name: Price Change, dtype: float64
In [30]:
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()

Observation In the preceding period, stocks within the healthcare sector have demonstrated the most notable average price escalation.

Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?.

In [31]:
df.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
Out[31]:
GICS Sector
Information Technology         149.818182
Telecommunications Services    117.000000
Health Care                    103.775000
Financials                      98.591837
Consumer Staples                70.947368
Energy                          51.133333
Real Estate                     50.111111
Consumer Discretionary          49.575000
Materials                       41.700000
Industrials                     36.188679
Utilities                       13.625000
Name: Cash Ratio, dtype: float64
In [32]:
plt.figure(figsize=(20,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.show()

Observation

  • The IT and Telecommunications sectors, characterized as relatively newer and less regulated industries, exhibit notably elevated average cash ratios compared to their peer sectors.
  • In contrast, the Utilities sector, which operates within a highly regulated environment, records the most modest average cash ratios among all sectors.

P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.

In [33]:
df.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
Out[33]:
GICS Sector
Energy                         72.897709
Information Technology         43.782546
Real Estate                    43.065585
Health Care                    41.135272
Consumer Discretionary         35.211613
Consumer Staples               25.521195
Materials                      24.585352
Utilities                      18.719412
Industrials                    18.259380
Financials                     16.023151
Telecommunications Services    12.222578
Name: P/E Ratio, dtype: float64
In [34]:
plt.figure(figsize=(15,8))

ax=sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()

Observation

  • Energy enterprises maintain a substantial lead in average P/E ratios across all sectors, while telecommunications companies feature the lowest average P/E ratios.

Volatility is responsible for the oscillations observed in stock prices. Stocks characterized by high volatility tend to experience more pronounced fluctuations, thereby increasing their investment risk. Now, let's examine the average variations in volatility within different economic sectors.

In [35]:
df.groupby('GICS Sector')['Volatility'].mean().sort_values(ascending=False)
Out[35]:
GICS Sector
Energy                         2.568777
Materials                      1.816726
Information Technology         1.659801
Consumer Discretionary         1.595478
Health Care                    1.541023
Industrials                    1.416989
Telecommunications Services    1.341612
Financials                     1.267255
Real Estate                    1.206053
Consumer Staples               1.152675
Utilities                      1.118018
Name: Volatility, dtype: float64
In [36]:
plt.figure(figsize=(15,8))

ax=sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()

Observation

  • Among the sectors, the energy sector exhibits the highest level of volatility, closely trailed by the material sector. In contrast, the utilities sector demonstrates the lowest level of volatility.

The Earnings Per Share (EPS) metric plays a significant role in influencing stock investments. EPS represents the portion of a company's profit allocated to each outstanding share of its common stock. It serves as a crucial indicator of a company's financial health and performance, impacting investor decisions Now, let's examine the average variations in Earnings Per Share within different economic sectors.

In [37]:
df.groupby('GICS Sector')['Earnings Per Share'].mean().sort_values(ascending=False)
Out[37]:
GICS Sector
Health Care                    4.541250
Consumer Discretionary         4.525750
Industrials                    4.456792
Financials                     4.220408
Telecommunications Services    3.550000
Consumer Staples               3.224211
Materials                      3.129000
Utilities                      2.753125
Real Estate                    2.340000
Information Technology         2.266061
Energy                        -6.908333
Name: Earnings Per Share, dtype: float64
In [38]:
plt.figure(figsize=(15,8))

ax=sns.barplot(data=df, x='GICS Sector', y='Earnings Per Share', ci=False)
for i in ax.containers:
    ax.bar_label(i,)

plt.xticks(rotation=90)
plt.show()

Observation

  • The health sector was found to have the highest earnings per share, an investor who invest in the health sector is likely to have a profit of 4.54 dollars or above per share in invested.

Data Preprocessing

  • Duplicate value check
  • Missing value treatment
  • Outlier check
  • Feature engineering (if needed)
  • Any other preprocessing steps (if needed)

Outlier Check

  • Plotting the boxplots of all numerical columns to check for outliers.
In [39]:
plt.figure(figsize=(15, 12))

numeric_columns = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(numeric_columns):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observation

  • Each of these variables contains a few or several outliers; nevertheless, these values do not seem implausible considering the inherent nature of stock prices and historical anticipations.

Scaling

  • Prior to engaging in the clustering process, it is imperative to preprocess and scale the data. This step ensures that each variable exerts a comparable influence on the clustering outcome, thus preventing any single variable from overshadowing others. By scaling the data, we create a balanced and equitable platform for the clustering analysis to unfold effectively.
In [40]:
num_col = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio','Net Cash Flow','Net Income','Earnings Per Share','Estimated Shares Outstanding','P/E Ratio','P/B Ratio']
In [41]:
num_col
Out[41]:
['Current Price',
 'Price Change',
 'Volatility',
 'ROE',
 'Cash Ratio',
 'Net Cash Flow',
 'Net Income',
 'Earnings Per Share',
 'Estimated Shares Outstanding',
 'P/E Ratio',
 'P/B Ratio']
In [42]:
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
In [43]:
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
In [44]:
#create pairplot for scaled dataframe
sns.pairplot(subset_scaled_df,diag_kind='kde')
Out[44]:
<seaborn.axisgrid.PairGrid at 0x7a9237a4fbb0>
In [45]:
subset_scaled_df.head()
Out[45]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

EDA

  • It is a good idea to explore the data once again after manipulating it.
In [45]:

K-means Clustering

Checking Elbow Plot

In [46]:
k_means_df = subset_scaled_df.copy()
In [47]:
#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(k_means_df)
    distortion = (
        sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.382318498894466
Number of Clusters: 3 	Average Distortion: 2.2692367155390745
Number of Clusters: 4 	Average Distortion: 2.1745559827866363
Number of Clusters: 5 	Average Distortion: 2.128799332840716
Number of Clusters: 6 	Average Distortion: 2.080400099226289
Number of Clusters: 7 	Average Distortion: 2.0289794220177395
Number of Clusters: 8 	Average Distortion: 1.964144163389972
Number of Clusters: 9 	Average Distortion: 1.9221492045198068
Number of Clusters: 10 	Average Distortion: 1.8513913649973124
Number of Clusters: 11 	Average Distortion: 1.8024134734578485
Number of Clusters: 12 	Average Distortion: 1.7900931879652673
Number of Clusters: 13 	Average Distortion: 1.7417609203336912
Number of Clusters: 14 	Average Distortion: 1.673559857259703
In [48]:
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)
visualizer.show()
Out[48]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

Checking the silhouette scores

In [49]:
sil_score = []
cluster_list =  range(2, 15)
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    preds = clusterer.fit_predict((subset_scaled_df))
    score = silhouette_score(k_means_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457)
For n_clusters = 3, the silhouette score is 0.4644405674779404)
For n_clusters = 4, the silhouette score is 0.4577225970476733)
For n_clusters = 5, the silhouette score is 0.43228336443659804)
For n_clusters = 6, the silhouette score is 0.4005422737213617)
For n_clusters = 7, the silhouette score is 0.3976335364987305)
For n_clusters = 8, the silhouette score is 0.40278401969450467)
For n_clusters = 9, the silhouette score is 0.3778585981433699)
For n_clusters = 10, the silhouette score is 0.13458938329968687)
For n_clusters = 11, the silhouette score is 0.1421832155528444)
For n_clusters = 12, the silhouette score is 0.2044669621527429)
For n_clusters = 13, the silhouette score is 0.23424874810104204)
For n_clusters = 14, the silhouette score is 0.12102526472829901)
In [50]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)
visualizer.show()
Out[50]:
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
In [51]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(14, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[51]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 14 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [52]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(13, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[52]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 13 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [53]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(12, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[53]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 12 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [54]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(11, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[54]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 11 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [55]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(10, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[55]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 10 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [56]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(9, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[56]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 9 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [57]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[57]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 8 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [58]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[58]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [59]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[59]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [60]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[60]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [61]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[61]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [62]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[62]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [63]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[63]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Observation

  • TThe K value was chosen based on Elbow method and the distortion which has 6 as its optimal K. Therefore 6 was choosen as the optimal K value for the K-Mean clustering.
  • The Silhouette could not give a good optimal K due to the fact that no K could cut through all the clusters and however k greater than 10 values seems to cut through all but that gives a lot of clustering.
In [145]:
# final K-means model
kmeans = KMeans(n_clusters=6, random_state=1)
kmeans.fit(k_means_df)
Out[145]:
KMeans(n_clusters=6, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [146]:
# creating a copy of the original data
df1 = df.copy()
In [147]:
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_

Cluster Profiling

In [148]:
km_cluster_profile = df1.groupby("KM_segments").mean()
In [149]:
km_cluster_profile["count_in_each_segment"] = (
    df1.groupby("KM_segments")["Security"].count().values
)
In [150]:
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Out[150]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
KM_segments                        
0 73.584688 5.178948 1.382351 34.836957 53.427536 -9571028.985507 1496099391.304348 3.678714 437210392.354746 23.667285 -3.382363 276
1 632.714991 7.374164 1.541343 19.333333 158.333333 -24046333.333333 907393166.666667 16.270000 125797901.323333 123.049240 35.355736 6
2 94.097709 16.583978 1.778223 29.375000 336.937500 2405463875.000000 1842690937.500000 2.321250 688296132.786250 41.052507 10.054741 16
3 26.990000 -14.060688 3.296307 603.000000 57.333333 -585000000.000000 -17555666666.666668 -39.726667 481910081.666667 71.528835 1.638633 3
4 50.517273 5.747586 1.130399 31.090909 75.909091 -1072272727.272727 14833090909.090910 4.154545 4298826628.727273 14.803577 -4.552119 11
5 44.475357 -13.336921 2.760005 39.678571 61.214286 -116741035.714286 -1794895750.000000 -4.733571 536400637.605000 99.412120 0.766948 28
In [151]:
## Complete the code to print the companies in each cluster
for cl in df1["KM_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df1[df1["KM_segments"] == cl]["Security"].unique())
    print()
In cluster 0, the following companies are present:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp'
 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp'
 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner'
 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group'
 'Crown Castle International Corp.' 'Carnival Corp.'
 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight'
 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.'
 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.'
 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation'
 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies'
 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions'
 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources'
 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.'
 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.'
 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade'
 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co'
 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'General Dynamics'
 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors'
 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares'
 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.'
 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International'
 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.'
 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company'
 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.'
 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group'
 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings'
 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.'
 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal'
 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International'
 'Roper Industries' 'Republic Services Inc' 'SCANA Corp'
 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.'
 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.'
 'Thermo Fisher Scientific' 'The Travelers Companies Inc.'
 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.'
 'Total System Services' 'Texas Instruments' 'Under Armour'
 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.'
 'United Health Group Inc.' 'Unum Group' 'Union Pacific'
 'United Parcel Service' 'United Technologies' 'Varian Medical Systems'
 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust'
 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc'
 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital'
 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc'
 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis']

In cluster 5, the following companies are present:
['Analog Devices, Inc.' 'Alexion Pharmaceuticals'
 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'EOG Resources' 'EQT Corporation'
 'Freeport-McMoran Cp & Gld' 'Hess Corporation'
 'Hewlett Packard Enterprise' 'Kinder Morgan' 'The Mosaic Company'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.'
 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.'
 'Williams Cos.' 'Cimarex Energy']

In cluster 1, the following companies are present:
['Alliance Data Systems' 'Amazon.com Inc' 'Chipotle Mexican Grill'
 'Intuitive Surgical Inc.' 'Priceline.com Inc' 'Regeneron']

In cluster 2, the following companies are present:
['Amgen Inc' 'Bank of America Corp' 'Celgene Corp.' 'Equinix'
 'Edwards Lifesciences' 'Facebook' 'First Solar Inc'
 'Frontier Communications' 'Halliburton Co.' "McDonald's Corp."
 'Monster Beverage' 'TripAdvisor' 'Vertex Pharmaceuticals Inc'
 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.']

In cluster 3, the following companies are present:
['Apache Corporation' 'Chesapeake Energy' 'Devon Energy Corp.']

In cluster 4, the following companies are present:
['Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.'
 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc'
 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']

In [152]:
#print number of stocks within each sector for all of the clusters
df1.groupby(["KM_segments", "GICS Sector"])['Security'].count()
Out[152]:
KM_segments  GICS Sector                
0            Consumer Discretionary         33
             Consumer Staples               17
             Energy                          5
             Financials                     45
             Health Care                    30
             Industrials                    52
             Information Technology         24
             Materials                      18
             Real Estate                    26
             Telecommunications Services     2
             Utilities                      24
1            Consumer Discretionary          3
             Health Care                     2
             Information Technology          1
2            Consumer Discretionary          3
             Consumer Staples                1
             Energy                          1
             Financials                      1
             Health Care                     5
             Information Technology          3
             Real Estate                     1
             Telecommunications Services     1
3            Energy                          3
4            Consumer Discretionary          1
             Consumer Staples                1
             Energy                          1
             Financials                      3
             Health Care                     2
             Information Technology          1
             Telecommunications Services     2
5            Energy                         20
             Health Care                     1
             Industrials                     1
             Information Technology          4
             Materials                       2
Name: Security, dtype: int64
In [153]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df1, x="KM_segments", y=variable)

plt.tight_layout(pad=2.0)
In [ ]:

KMeans Clusters

Cluster 0

  • 276 which comprised sectors from Consumer Discretionary,Consumer Staples,Energy,Financials,Health Care,Industrials,Information Technology,Materials,Real Estate,Telecommunications Services, and Utilities
  • Companies within this cluster have:
    • Low Volatility
    • Low net income
    • low cashflow
    • low cash ratio
    • low P/E Ratio

Cluster 1

  • 3 stocks, comprised mostly of stocks within the Consumer Discretionary, Health Care, and Information Technology
  • Companies within this cluster have:
  • low Volatility
  • Highest current Price
  • High price change
  • lowest Return on Equity
  • Low Net income
  • low Cash Flow

Cluster 2

  • 16 stocks from companies like Consumer Discretionary,Consumer Staples, Energy,Financials,Health Care,Information Technology,Real Estate,Telecommunications Services.
  • Companies within this cluster have:

  • Moderate Volatility

  • Highest current Price
  • Low price change
  • low Return on Equity
  • Low Net income
  • low Cash Flow

Cluster 3

  • 3 stocks, comprised of stocks from the Energy sectors
  • Companies within this cluster have:
    • Highest Volatility
  • low current Price
  • Low price change
  • Highest Return on Equity
  • Lowest Net income
  • low Cash Flow

Cluster 4

  • 11 stocks, comprised of stocks from the Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Information Technology, and Telecommunications Services sectors.
  • Companies within this cluster have:
  • low Volatility
  • low current Price
  • Moderate price change
  • Low Return on Equity
  • Highest Net income
  • low and high Cash Flow

Cluster 5

  • 28 stocks, comprised of stocks from the Energy, Health Care, Industrials, Information Technology, and Materials sectors.
  • Companies within this cluster have:
  • High Volatility
  • low current Price
  • Low and negative price change
  • Low Return on Equity
  • low Net income
  • low Cash Flow

Hierarchical Clustering

Computing Cophenetic Correlation

In [73]:
hc_df = subset_scaled_df.copy()
In [74]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428.
Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818.
Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
In [75]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
****************************************************************************************************
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.

Let's explore different linkage methods with Euclidean distance only.

In [76]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm

# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)
Cophenetic correlation for single linkage is 0.9232271494002922.
Cophenetic correlation for complete linkage is 0.7873280186580672.
Cophenetic correlation for average linkage is 0.9422540609560814.
Cophenetic correlation for centroid linkage is 0.9314012446828154.
Cophenetic correlation for ward linkage is 0.7101180299865353.
Cophenetic correlation for weighted linkage is 0.8693784298129404.
****************************************************************************************************
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.

Let's view the dendrograms for the different linkage methods with Euclidean distance.

Checking Dendrograms

In [77]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

    compare.append([method, coph_corr])

Observation

  • The highest cophenetic correlation is observed for the average and centroid linkage methods, yet the dendrogram associated with the average linkage method seems to yield more coherent clusters.
  • The average linkage method suggests that the optimal number of clusters is 5.
In [78]:
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
Out[78]:
Linkage Cophenetic Coefficient
4 ward 0.710118
1 complete 0.787328
5 weighted 0.869378
0 single 0.923227
3 centroid 0.931401
2 average 0.942254

Creating model using sklearn

In [119]:
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
Out[119]:
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [120]:
# creating a copy of the original data
df2 = df.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_

Cluster Profiling

In [121]:
hc_cluster_profile = df2.groupby("HC_segments").mean()
In [122]:
hc_cluster_profile["count_in_each_segment"] = (
    df2.groupby("HC_segments")["Security"].count().values
)
In [123]:
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Out[123]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
HC_segments                        
0 76.979246 4.142346 1.516898 35.200000 67.280597 67033707.462687 1610408549.253731 2.887716 573826376.546119 32.329515 -2.152930 335
1 1274.949951 3.190527 1.268340 29.000000 184.000000 -1671386000.000000 2551360000.000000 50.090000 50935516.070000 25.453183 -1.052429 1
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1292500000.000000 -19106500000.000000 -41.815000 519573983.250000 60.748608 1.565141 2
3 104.660004 16.224320 1.320606 8.000000 958.000000 592000000.000000 3669000000.000000 1.310000 2800763359.000000 79.893133 5.884467 1
4 276.570007 6.189286 1.116976 30.000000 25.000000 90885000.000000 596541000.000000 8.910000 66951851.850000 31.040405 129.064585 1
In [124]:
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df2[df2["HC_segments"] == cl]["Security"].unique())
    print()
In cluster 0, the following companies are present:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co'
 'Ameren Corp' 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion'
 'Alexion Pharmaceuticals' 'Applied Materials Inc' 'AMETEK Inc'
 'Affiliated Managers Group Inc' 'Amgen Inc' 'Ameriprise Financial'
 'American Tower Corp A' 'Amazon.com Inc' 'AutoNation Inc' 'Anthem Inc.'
 'Aon plc' 'Anadarko Petroleum Corp' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Bank of America Corp' 'Baxter International Inc.' 'BB&T Corporation'
 'Bard (C.R.) Inc.' 'Baker Hughes Inc' 'BIOGEN IDEC Inc.'
 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb'
 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Citigroup Inc.'
 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group'
 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.'
 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight'
 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.'
 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.'
 'CME Group Inc.' 'Chipotle Mexican Grill' 'Cummins Inc.' 'CMS Energy'
 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial'
 'Cabot Oil & Gas' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc'
 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health'
 'Chevron Corp.' 'Concho Resources' 'Dominion Resources' 'Delta Air Lines'
 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services'
 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company'
 'Discovery Communications-A' 'Discovery Communications-C'
 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet'
 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.'
 'Devon Energy Corp.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'EOG Resources'
 'Equinix' 'Equity Residential' 'EQT Corporation' 'Eversource Energy'
 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation'
 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Ford Motor' 'Fastenal Co'
 'Fortune Brands Home & Security' 'Freeport-McMoran Cp & Gld'
 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc'
 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'First Solar Inc'
 'Frontier Communications' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.'
 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.'
 'Hess Corporation' 'Hartford Financial Svc.Gp.' 'Harley-Davidson'
 "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.'
 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts'
 'The Hershey Company' 'Humana Inc.' 'International Business Machines'
 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'Intel Corp.'
 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated'
 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'JPMorgan Chase & Co.' 'Kimco Realty' 'Kimberly-Clark'
 'Kinder Morgan' 'Coca Cola Company' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp"
 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries'
 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials'
 'Marsh & McLennan' '3M Company' 'Monster Beverage' 'Altria Group Inc'
 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.'
 'Marathon Oil Corp.' 'M&T Bank Corp.' 'Mettler Toledo' 'Murphy Oil'
 'Mylan N.V.' 'Navient' 'Noble Energy Inc' 'NASDAQ OMX Group'
 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.'
 'Newfield Exploration Co' 'Nielsen Holdings'
 'National Oilwell Varco Inc.' 'Norfolk Southern Corp.'
 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands'
 'Realty Income Corporation' 'ONEOK' 'Omnicom Group' "O'Reilly Automotive"
 'Occidental Petroleum' "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Pfizer Inc.' 'Principal Financial Group' 'Procter & Gamble'
 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International'
 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital'
 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66'
 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System'
 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International'
 'Roper Industries' 'Range Resources Corp.' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.'
 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty'
 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Skyworks Solutions' 'Southwestern Energy' 'Synchrony Financial'
 'Stryker Corp.' 'AT&T Inc' 'Molson Coors Brewing Company'
 'Teradata Corp.' 'Tegna, Inc.' 'Torchmark Corp.'
 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.'
 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.'
 'Total System Services' 'Texas Instruments' 'Under Armour'
 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.'
 'United Health Group Inc.' 'Unum Group' 'Union Pacific'
 'United Parcel Service' 'United Technologies' 'Varian Medical Systems'
 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust'
 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc'
 'Ventas Inc' 'Verizon Communications' 'Waters Corporation'
 'Wec Energy Group Inc' 'Wells Fargo' 'Whirlpool Corp.'
 'Waste Management Inc.' 'Williams Cos.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd'
 'Cimarex Energy' 'Xcel Energy Inc' 'XL Capital' 'Exxon Mobil Corp.'
 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yahoo Inc.'
 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis']

In cluster 4, the following companies are present:
['Alliance Data Systems']

In cluster 2, the following companies are present:
['Apache Corporation' 'Chesapeake Energy']

In cluster 3, the following companies are present:
['Facebook']

In cluster 1, the following companies are present:
['Priceline.com Inc']

In [125]:
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
Out[125]:
HC_segments  GICS Sector                
0            Consumer Discretionary         39
             Consumer Staples               19
             Energy                         28
             Financials                     49
             Health Care                    40
             Industrials                    53
             Information Technology         31
             Materials                      20
             Real Estate                    27
             Telecommunications Services     5
             Utilities                      24
1            Consumer Discretionary          1
2            Energy                          2
3            Information Technology          1
4            Information Technology          1
Name: Security, dtype: int64
In [126]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df2, x="HC_segments", y=variable)

plt.tight_layout(pad=2.0)

Let us try using Ward linkage as it has more distinct and separated clusters (as seen from it's dendrogram before). 6 appears to be the appropriate number of clusters from the dendrogram for Ward linkage.

Creating Final model using sklearn

In [137]:
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
Out[137]:
AgglomerativeClustering(affinity='euclidean', n_clusters=6)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [138]:
# creating a copy of the original data
df2 = df.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_

Cluster Profiling

In [139]:
hc_cluster_profile = df2.groupby("HC_segments").mean()
In [140]:
hc_cluster_profile["count_in_each_segment"] = (
    df2.groupby("HC_segments")["Security"].count().values
)
In [141]:
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Out[141]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
HC_segments                        
0 42.848182 6.270446 1.123547 22.727273 71.454545 558636363.636364 14631272727.272728 3.410000 4242572567.290909 15.242169 -4.924615 11
1 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
2 325.996105 7.724708 1.545762 17.000000 360.333333 186503166.666667 1008736416.666667 9.544167 589434517.801667 41.121871 25.258283 12
3 72.760400 5.213307 1.427078 25.603509 60.392982 79951512.280702 1538594322.807018 3.655351 446472132.228456 24.722670 -2.647194 285
4 36.440455 -16.073408 2.832884 57.500000 42.409091 -472834090.909091 -3161045227.272727 -8.005000 514367806.201818 85.555682 0.836839 22
5 327.006671 21.917380 2.029752 4.000000 106.000000 698240666.666667 287547000.000000 0.750000 366763235.300000 400.989188 -5.322376 3
In [142]:
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df2[df2["HC_segments"] == cl]["Security"].unique())
    print()
In cluster 3, the following companies are present:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co'
 'Ameren Corp' 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp'
 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner'
 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group'
 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.'
 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight'
 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial'
 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy'
 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial'
 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc'
 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health'
 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)'
 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics'
 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.'
 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences'
 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage'
 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.'
 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.'
 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc."
 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp"
 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries'
 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials'
 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company'
 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo'
 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy'
 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings'
 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.'
 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.'
 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd'
 'Robert Half International' 'Roper Industries' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy'
 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions'
 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company'
 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor'
 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods'
 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments'
 'Under Armour' 'United Continental Holdings' 'UDR Inc'
 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group'
 'Union Pacific' 'United Parcel Service' 'United Technologies'
 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials'
 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.'
 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Wec Energy Group Inc'
 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd'
 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.'
 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp'
 'Zoetis']

In cluster 2, the following companies are present:
['Alliance Data Systems' 'Amgen Inc' 'Chipotle Mexican Grill' 'Equinix'
 'Facebook' 'Frontier Communications' 'Intuitive Surgical Inc.'
 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'Waters Corporation'
 'Yahoo Inc.']

In cluster 1, the following companies are present:
['Allegion' 'Apache Corporation' 'Chesapeake Energy'
 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark'
 'S&P Global, Inc.']

In cluster 5, the following companies are present:
['Alexion Pharmaceuticals' 'Amazon.com Inc' 'Netflix Inc.']

In cluster 4, the following companies are present:
['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources'
 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kinder Morgan'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.'
 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy']

In cluster 0, the following companies are present:
['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.'
 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc'
 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']

In [143]:
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
Out[143]:
HC_segments  GICS Sector                
0            Consumer Discretionary          1
             Consumer Staples                1
             Energy                          1
             Financials                      4
             Health Care                     1
             Information Technology          1
             Telecommunications Services     2
1            Consumer Discretionary          1
             Consumer Staples                2
             Energy                          2
             Financials                      1
             Industrials                     1
2            Consumer Discretionary          2
             Consumer Staples                1
             Health Care                     4
             Information Technology          3
             Real Estate                     1
             Telecommunications Services     1
3            Consumer Discretionary         35
             Consumer Staples               15
             Energy                          7
             Financials                     44
             Health Care                    34
             Industrials                    52
             Information Technology         27
             Materials                      19
             Real Estate                    26
             Telecommunications Services     2
             Utilities                      24
4            Energy                         20
             Information Technology          1
             Materials                       1
5            Consumer Discretionary          1
             Health Care                     1
             Information Technology          1
Name: Security, dtype: int64
In [144]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df2, x="HC_segments", y=variable)

plt.tight_layout(pad=2.0)

Hierarchical Clusters

Cluster 0

  • 11 stocks, comprised mostly of stocks within theConsumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Information Technology, and Telecommunications Services.
  • Companies within this cluster have:
  • Low Volatility
  • Very low current Price
  • Low Return on Equity
  • Highest Net income
  • Highest Cash Flow
  • Highest Estimated Outstanding Shares

Cluster 1

  • 7 stocks, comprised mostly of stocks within the Consumer Discretionary, Consumer Staples, Energy, Financials, and Industrials

  • Companies within this cluster have:

  • Moderate Volatility
  • Low current Price
  • Highest Return on Equity
  • Low Net income
  • Moderate Cash Flow
  • Low Estimated Outstanding Shares

Cluster 2

  • 12 stocks Consumer Discretionary, Consumer Staples, Health Care, Information Technology,Real Estate, and Telecommunications Services.
  • Companies within this cluster have:
  • Moderate Volatility
  • Higest current Price
  • Very low Return on Equity
  • Low Net income
  • low Cash Flow

Cluster 3

  • 285 stocks, a vast majority of which are in the Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care,Industrials, Information Technology, Materials, Real Estate, Telecommunications Services, and Utilities. Companies within this cluster have:
  • low Volatility
  • low current Price
  • Very low Return on Equity
  • Low Net income
  • low Cash Flow

Cluster 4

  • 23 stocks, from Energy,Information Technology,Materials and,Consumer Discretionary sectors. Companies within this cluster have:
  • Highest Volatility
  • low current Price
  • low Return on Equity
  • Very Low Net income
  • low Cash Flow

Cluster 5

  • 3 stocks, from Consumer Discretionary, Health Care, Information Technology Sectors. Companies within this cluster have:
  • High Volatility
  • High current Price
  • Highest price change
  • lowest Return on Equity
  • Low Net income
  • low Cash Flow

K-means vs Hierarchical Clustering

You compare several things, like:

  • Which clustering technique took less time for execution?
  • Which clustering technique gave you more distinct clusters, or are they the same?
  • How many observations are there in the similar clusters of both algorithms?
  • How many clusters are obtained as the appropriate number of clusters from both algorithms?

You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.

Which clustering technique took less time for execution?

  • Both the KMeans model and the Agglomerative Clustering model successfully fit the dataset in approximately 0.1 seconds. Which clustering technique gave you more distinct clusters, or are they the same? How many observations are there in the similar clusters of both algorithms? Both algorithms yield comparable clusters, featuring a dominant cluster encompassing a majority of the stocks, while the remaining five clusters consist of 2 to 28 stocks each.

How many clusters are obtained as the appropriate number of clusters from both algorithms? In the case of both algorithms, the utilization of 6 clusters resulted in well-defined clusters, each containing an ample number of observations to effectively distinguish the representative "type" of stock within the cluster.

Differences or similarities in the cluster profiles from both the clustering techniques

  • Both algorithms produced analogous clusters by considering the outliers present across the 11 variables.

Actionable Insights and Recommendations

-

  • Trade&Ahead should initially assess their clients' financial objectives, risk tolerance, and investment patterns before suggesting a cluster that aligns with these requirements as a potential stock portfolio.

  • Nevertheless, a considerable number of these clusters, based on their constituent stock attributes, essentially serve as alternatives to standard indexes. Achieving these objectives could potentially be more straightforward using such indexes.

  • Alternatively, Trade&Ahead could employ these clusters as a foundational step for conducting comprehensive financial statement analysis. This approach could specifically highlight stocks that deviate from the cluster "profile."

  • If the process of handpicking individual stocks is integral to a client's investment approach, Trade&Ahead may then be well-positioned to identify stocks that are likely to outperform their counterparts (resulting in a buy recommendation) or underperform (prompting a sell recommendation).